属性图数据格式 (PGDF)

The Property Graph Data Format (PGDF)


This graphical abstract shows that PGDF requires less disk space than YARS-PG, GraphML, and JSON. For each graph data format, the chart illustrates the file sizes of diff...

Abstract:

Abstract:

Property graphs are popular in both industry and academia due to their versatility in modeling complex data across diverse application domains, ranging from social networ...Show More

抽象的:

Abstract:

属性图在工业界和学术界都很受欢迎,因为它们可以在从社交网络到知识图谱等各种应用领域中对复杂数据进行建模。尽管属性图很受欢迎,但目前还没有用于存储和交换属性图的标准化数据格式。本文介绍了 PGDF,这是一种基于文本的属性图数据格式,旨在简单灵活,同时保持表现力和效率。PGDF 的简单性来自其类似表格的结构,其中 PGDF 文件中的每一行都包含单个模式或数据声明。PGDF 允许以任何顺序组合模式和数据声明,从而提供了极大的灵活性。这意味着节点和边可以各自具有不同的属性,从而提供更大的适应性和定制性。PGDF 的表现力取决于它表示各种属性图特征的能力。在本文中,我们描述了 PGDF 的语法和语义,概述了将存储在多个 CSV 文件中的属性图转换为 PGDF 和其他图形数据格式的方法,并进行了比较 PGDF、YARS-PG、GraphML 和 JSON-Neo4j 的实验评估。实验表明,与其他图形数据格式相比,PGDF 能够更快地生成较小的文件。
该图形摘要显示 PGDF 所需的磁盘空间比 YARS-PG、GraphML 和 JSON 要少。对于每种图形数据格式,该图表说明了不同格式的文件大小...
发表于:IEEE Access 卷:12
页数: 159267 - 159279
出版日期: 2024 年 10 月 24 日
电子版 ISSN: 2169-3536

资助机构:

Funding Agency:


第一部分

介绍

Introduction

近年来,属性图已成为表示和分析数据内复杂关系的新范式。属性图由节点、边和属性[1]组成,不仅在学术研究[2][3][4]中而且在工业界[5][6]中都获得了极大的重视。它们能够在从社交网络[7][8]到知识图谱[9]等广泛应用中建模复杂的连接和属性,使其成为数据管理[10]、分析[11]和知识发现[12]不可或缺的工具。

In recent years, property graphs have emerged as a novel paradigm for representing and analyzing complex relationships within data. Property graphs, which consist of nodes, edges and properties [1], have gained immense prominence not only in academic research [2], [3], [4] but also in industry [5], [6]. Their ability to model intricate connections and attributes in a wide range of applications, from social networks [7], [8] to knowledge graphs [9], has made them an indispensable tool for data management [10], analytics [11], and knowledge discovery [12].

随着对图数据利用的需求不断增长,一个值得注意的挑战出现了:缺乏用于存储和交换属性图的标准化和全面的数据格式。传统的替代方案,如 CSV(逗号分隔值)或甚至更结构化的 GraphSON 和 GraphML 格式[13],虽然适用于某些用例,但无法捕捉属性图的丰富和细微差别[1];例如,通过不明确允许多值属性(例如,GraphML [13]是基于 XML 的,尽管 XML 允许重复标签为属性编码多个值,但 GraphML 不支持这一点,因为属性名称是在标签属性中编码的钥匙,在边或节点内必须是唯一的)。

As the demand for graph data utilization continues to grow, a noteworthy challenge has emerged: the lack of a standardized and comprehensive data format for storing and exchanging property graphs. Traditional alternatives like CSV (Comma-Separated Values) or even the more structured GraphSON and GraphML formats [13], while suitable for certain use cases, can fall short in capturing the rich and nuanced features of property graphs [1]; for instance, by not explicitly allowing multi-valued properties (e.g., GraphML [13] is XML-based, and even though XML allows repeating tags to encode multiple values for a property, GraphML does not support this as the property names are encoded in the tag attribute key, which must be unique within an edge or node).

为了解决上述问题,我们创建了 PGDF,一种属性图的数据格式,旨在满足以下特点:a) 简单性:PGDF 遵循直观的表格结构,使得 PGDF 文件中的每一行都包含单个模式或数据声明。b) 灵活性:模式和数据声明可以按任何顺序组合,允许具有不同属性的节点和边。c) 表现力:PGDF 能够表示各种各样的属性图特征,例如多个标签、多值属性和边 ID。

In response to the aforementioned issues, we have created PGDF, a data format for property graphs which was designed to satisfy the following characteristics: a) Simplicity: PGDF follows an intuitive tabular-like structure such that each line in a PGDF file contains a single schema or data declaration. b) Flexibility: the schema and data declarations can be combined in any order, allowing nodes and edges with distinct properties. c) Expressiveness: PGDF is able to represent a wide variety of property graph features such as multiple labels, multi-valued properties and edge IDs.

此外,PGDF 还具有以下优势:PGDF 具有通用的潜力,而不像基于 JSON 的序列化那样特定于供应商(即,使用 Neo4j 获得的 JSON 文件不能直接导入其他系统);PGDF 可用于序列化从流行的图形数据库系统(包括 Neo4j、Amazon Neptune 和 TigerGraph)获得的属性图;序列化为多个 CSV 文件的属性图可以存储在单个 PGDF 文件中;并且,PGDF 生成的文件比其他序列化(如 GraphSON、GraphML 和 YARS-PG)更小。

Moreover, PGDF offers the following advantages: PGDF has the potential to be generic, unlike JSON-based serializations which are vendor-specific (i.e., a JSON file obtained with Neo4j may not be imported directly in other systems); PGDF can be used to serialize property graphs obtained from popular graph database systems, including Neo4j, Amazon Neptune and TigerGraph; a property graph serialized as multiple CSV files can be stored in a single PGDF file; and, PGDF produces smaller files than other serializations such as GraphSON, GraphML and YARS-PG.

在本文中,我们介绍了 PGDF,研究了转换方法,并进行了实证评估。本文的结构如下:第二部分包含属性图数据模型的正式定义;第三部分回顾了属性图的当前数据格式;第四部分包含 PGDF 的语法和语义,以及生成 PGDF 文件的算法;第五部分包括将属性图从 CSV 转换为 PGDF 的方法,包括基于 LDBC 社交网络基准 (LDBC-SNB) 的用例示例;第六部分解释将基于 CSV 的属性图转换为其他图形格式(即 YARS-PG、GraphML、JSON)的方法;第七部分提供证据表明 PGDF 能够比其他数据格式更快地生成更小的文件。

In this article we introduce PGDF, study transformation methods, and present an empirical evaluation. This paper is organized as follows: Section II contains a formal definition of the property graph data model; Section III includes a review of current data formats for property graphs; Section IV contains the syntax and semantics of PGDF, and an algorithm for producing a PGDF file; Section V includes a method to convert property graphs from CSV to PGDF, including a use-case example based on the LDBC Social Network Benchmark (LDBC-SNB); Section VI explain methods to convert a CSV-based property graph to other graph formats (i.e. YARS-PG, GraphML, JSON); Section VII presents evidence that PGDF is able to produce smaller files faster than other data formats.

第二部分

属性图

Property Graphs

属性图是有向多图,其中节点和边可以具有标签和属性(即名称-值对)。图 1显示了由三个节点和两个边组成的属性图的示例。节点 1 有两个标签(员工) 和三个单值属性 (姓名年龄, 和位置)。节点 2 有标签项目,两个单值属性(姓名日期)和多值属性(团队)。节点 3 有标签客户,两个单值属性(姓名年龄) 以及多值属性 (利益)。此外,节点之间还有两种关系:带有标签的边 1001工作进行中以及带有标签的边 1002合同. 两条边也都具有属性。

A property graph is a directed multi-graph where the nodes and the edges can have labels and properties (i.e. name-value pairs). Figure 1 shows an example of property graph composed of three nodes and two edges. Node 1 has two labels (PERSON and EMPLOYEE) and three single-value properties (name, age, and position). Node 2 has label PROJECT, two single-valued properties (name and date) and a multi-valued property (team). Node 3 has label CLIENT, two single-valued properties (name and age), and a multi-valued property (interests). In addition, there are two relationships between the nodes: edge 1001 with label WORKS_ON, and edge 1002 with label CONTRACT. Both edges also have properties.

图 1.- 属性图的示例。
图 1.

属性图的示例。

Example of a property graph.

我们扩展了[1]中提出的属性图定义,以包含节点和边标识符以及边的方向性。为此,我们假设以下可数无限集:L(节点和边标签)、Props(属性名称)、Vals(属性值)以及 VID 和 EID(节点和边标识符)。

We extend the property graph definition presented in [1], to include node and edge identifiers, as well as directionality for edges. For this, we assume the following countably infinite sets: L (node and edge labels), Props (property names), Vals (property values), and VID and EID (nodes and edge identifiers).

定义1:

Definition 1:

属性图 G六元组,, ρ , λ , σ, δ , 在哪里:

A Property Graph G is a six-tuple (V,E,ρ,λ,σ,δ) , where:

  • V I D 是一组有限的节点;

  • VVID is a finite set of nodes;

  • E I D 是一组有限的边;

  • EEID is a finite set of edges;

  • ρ : E× 是一个将E中的每条边与V中的一对节点相关联的函数;

  • ρ:EV×V is a function that associates each edge in E with a pair of nodes, both in V;

  • λ :电压2大号 是一个将每个节点和边与一组标签(可能是空的)相关联的函数;

  • λ:VE2L is a function that associates each node and edge with a set (which may be empty) of labels;

  • σ: () × P r o p s2瓦尔 是一个用属性名称和属性值连接节点和边的部分函数;

  • σ:(VE)×Props2Vals is a partial function that connects nodes and edges with property names and property values;

  • δE{ , , } 是一个为每条边分配方向的函数。

  • δ:E{,,} is a function that assigns a direction to each edge.

因此,图 1所示的属性图可以定义为G = V,, ρ , λ , σ, δ 在哪里:

Hence, the property graph shown in Figure 1 can be defined as G=(V,E,ρ,λ,σ,δ) where:

  • [电视= { 1 , 2 , 3 } ,

  • [t]V={1,2,3},

  • [] E= { 1001 , 1002 }

  • [t]E={1001,1002},

  • [ t ] ρ = { 1001 ( 1 , 2 ) , 1002 ( 2 , 3 ) } ,

  • [t]ρ={1001(1,2),1002(2,3)},

  • λ = { 1 {人员,员工} ,2 {项目} 3 {客户} 1001 {作品} 1002 {合同} }

  • λ={1{PERSON,EMPLOYEE},2{PROJECT},3{CLIENT},1001{WORKS_ON},1002{CONTRACT}},

  • σ= { ( 1 ,姓名) {约翰} , ( 1 ,年龄) { 30 } ,1 位置{工程师} 2 名称{项目:}2 日期{ 23/07/01 } 2 团队{约翰安娜} 3 姓名{ Charles } 3 年龄{ 40 } 3 兴趣{科技旅游} 1001 小时{30} 1002 日期{ 2023/08/20 } 1002 金额{5000} } 并且 

  • σ={(1,name){John},(1,age){30},(1,position){Engineer},(2,name){Project: B},(2,date){23/07/01},(2,team){John,Ana},(3,name){Charles},(3,age){40},(3,interests){Technology,Travel},(1001,hours){30},(1002,date){2023/08/20},(1002,amount){5000}};and 

  • δ 是这样的δ1001 = , 和δ1002 =

  • δ is such that δ(1001)= , and δ(1002)=

稍后会变得重要的另一个概念是节点和边的模式。属性图模式可以用多种方式理解,例如,它可以是定义哪些类型的节点可以具有哪些属性,并且应该通过给定的一组边连接的规则[4]。为了我们的工作目的,我们以每个节点和每个边为基础定义模式,简单地说,节点(或边)的模式是为所述节点(或边)定义的一组属性。我们正式定义这个概念如下。

Another concept that shall become important later on is the one of the schema for nodes and edges. A property graph schema can be understood in many ways, for instance, as rules defining which kinds of nodes can have which properties and should be connected by a given set of edges [4]. For the purposes of our work, we define the schema in a per node and per edge basis by simply saying that the schema of a node (resp. edge) is the set of properties defined for said node (resp. edge). We define this notion formally as follows.

定义 2(模式):

Definition 2 (Schemas):

G = V,, ρ , λ , σ, δ 是一个属性图。节点的架构 n∈V 是属性名称的集合sch (n)={p(n,p) dom (σ} . 类似地,边的模式 e∈E 是属性名称的集合sch (e)={p(e,p) dom (σ} .我们表示为学校节点 G中所有不同节点模式的集合(即学校节点( G ) = { sch ( n ) n V} )以及学校边缘 G中所有不同边模式的集合(即学校边缘( G ) = { sch ( e ) e E} )。

Let G=(V,E,ρ,λ,σ,δ) be a property graph. The schema of a node nV is the set of property names sch(n)={p(n,p)dom(σ)} . Similarly, the schema of an edge eE is the set of property names sch(e)={p(e,p)dom(σ)} . We denote by schnode(G) the set of all different node schemas in G (i.e., schnode(G)={sch(n)nV} ); and by schedge(G) the set of all different edge schemas in G (i.e., schedge(G)={sch(e)eE} ).

例如,图 1中节点 1 的架构为sch (1)={姓名,年龄,职位} ,边 1001 的模式为sch (1001)={小时}

For example, the schema of node 1 in Figure 1 is sch(1)={name,age,position} , and the schema of the edge 1001 is sch(1001)={hours} .

第三部分

图形数据格式

Graph Data Formats

序列化图有多种数据格式,但只有其中一些支持属性图。在本节中,我们将回顾 GraphML、YARS-PG、PG Format、JSON-Neo4j 和 GraphSON Tinkerpop 的语法和语义。

There are several data formats for serializing graphs, but just some of them support property graphs. In this section we review the syntax and semantics of GraphML, YARS-PG, PG Format, JSON-Neo4j, and GraphSON Tinkerpop.

GraphML

A. GraphML

图形标记语言 (GraphML) [13]是一种将图形存储为 XML 的文件格式。GraphML 有两个部分。第一部分是边和节点的属性名称和数据类型的定义。在定义节点的属性之前,需要定义一个特殊的属性标签V可用于存储节点标签。然后,定义所有节点属性名称。类似地,属性标签E可以为所有边属性名称后面的边标签定义。在第二部分中,每个节点都定义在<节点>标签包含其标签和属性值;然后每条边在<边缘>标签。属性定义为<数据>标签在 key 属性中指定名称。多个<数据>具有相同标签钥匙为了支持多值属性,节点或边不允许有值。在图 2中,我们展示了一个 GraphML 文件,该文件序列化了图 1中显示的属性图的数据,其中多标签和多值属性被截断。

The Graph Markup Language (GraphML) [13] is a file format to store graphs as XML. GraphML has two sections. The first section is the definition of the property names and datatypes of edges and nodes. Before defining the properties of a node, a special property labelV can be used to store node labels. Then, all the node property names are defined. Similarly, the property labelE can be defined for edge labels followed for all edge property names. In the second section, each node is defined within a <node> tag containing its labels and property values; and then each edge is similarly described within an <edge> tag. Properties are defined with <data> tags specifying the name in the key attribute. Multiple <data> tags with the same key value are not permitted for a node or edge in order to support multi-valued properties. In Figure 2, we present a GraphML file that serializes the data of the property graph shown in Figure 1, where multi-label and multi-valued properties are truncated.

图 2.- GraphML 格式的示例。
图 2.

GraphML 格式的示例。

Example of the GraphML format.

B. YARS-PG

B. YARS-PG

YARS-PG 是一种旨在支持属性图发布和交换的数据格式。YARS-PG 有两个版本,分别在[14][15]中介绍。在本文中,我们考虑第一个版本,因为它的语法比第二个版本中定义的语法更简单。

YARS-PG is a data format designed to support the publication and exchange of property graphs. There are two versions of YARS-PG which are presented in [14] and [15] respectively. In this article we consider the first version because its syntax is simpler than the one defined in the second version.

在 YARS-PG 文件中,必须先定义节点声明,然后再定义边声明。节点声明以节点 ID 开头,后跟方括号 ([]) 包含该节点的标签列表,用冒号 (),并以花括号 ({}) 结尾,其中包含表示节点的属性名称和值的键/值对。边声明以源节点的 ID 开头,括在括号中。后面跟着一个连字符 (-)。方括号内放置一个边的标签,后跟一个空格,然后是花括号内的属性名称和值列表(类似于节点的情况)。之后,附加一个 ASCII 箭头(->)。最后,在括号中添加目标节点的 ID。

In a YARS-PG file, declarations for nodes must be defined first, and then followed by declarations for edges. A node declaration begins with the node ID, followed by square brackets ([]) that enclose a list of labels for that node separated by colons (:), and ends with curly braces ({}) that contain key/value pairs representing the property names and values of the node. An edge declaration begins with the ID of the source node, enclosed in parentheses. This is followed by a hyphen (-). Within square brackets, one label for the edge is placed, followed by a space, and then the list of property names and values enclosed in curly braces (similar to the case of nodes). Afterwards, an ASCII arrow is appended (->). Finally, the ID of the target node is added in parentheses.

在图 3中,我们展示了图 1中所示属性图的 YARS-PG 声明,其中多值属性被截断,因为 YARS-PG 未明确提供对它们的支持(例如,通过定义数组)。请注意,YARS-PG 仅支持每条边一个标签。

In Figure 3, we present the YARS-PG declarations for the property graph shown in Figure 1, where multi-valued properties are truncated, as YARS-PG does not explicitly provide support for them (e.g., by defining arrays). Note that YARS-PG supports only one label per edge.

图 3.- YARS-PG 格式的示例。
图 3.

YARS-PG 格式的示例。

Example of the YARS-PG format.

C.PG 格式

C. PG Format

PG 格式[16]是一种类似于 YARS-PG 的交换格式,但语法更简单,并且允许通过重复键来实现多值属性。每个节点和每条边都在自己的行上,ID、标签和属性/值对的字段用空格分隔。在图 4中,我们展示了图 1中显示的属性图的序列化,其中可以注意到边标识符丢失了。

PG Format [16] is an exchange format similar to YARS-PG, but with a simpler syntax and allowing multi-valued properties by repeating the key. Each node and each edge are on their own line, with space separated fields for IDs, labels and property/value pairs. In Figure 4 we present the serialization of the property graph shown in Figure 1, where it can be noted that edge identifiers are lost.

图 4.- PG 格式的示例。
图 4.

PG 格式的示例。

Example of the PG Format.

D.JSON-Neo4j

D. JSON - Neo4j

Javascript 对象表示法 (JSON) 也可用于序列化属性图。Neo4j 定义了自己的导入/导出 JSON 格式,其中每个节点和边都序列化为一个对象,以类型属性。在每个对象中,标签、属性名称和值都以通常的 JSON 方式序列化,图 5中给出了一个示例,其中展示了一个用于 Neo4j 的 JSON,它序列化了图 1的属性图。

The Javascript Object Notation (JSON) can also be used to serialize property graphs. Neo4j defines their own import/export JSON format, where each node and edge is serialized as an object distinguished by the type property. In each object, labels, property names, and values are serialized in the usual JSON way, for which an example can be found in Figure 5 presenting a JSON for Neo4j that serializes the property graph of Figure 1.

图 5.- JSON-Neo4j 格式的示例。
图 5.

JSON-Neo4j 格式的示例。

Example of the JSON-Neo4j format.

E. GraphSON-TinkerPop3

E. GraphSON - TinkerPop3

GraphSON - TinkerPop 3 (TP3) 是一种受 JSON 启发的序列化属性图的数据格式,旨在易于在分布式系统中拆分和加载。在 GraphSON 文件中,每个节点都被序列化为一个 JSON 对象,该对象具有节点的 id 和标签,以及三个嵌套对象。第一个嵌套对象包含从节点出去的所有边:它们的 id、标签和属性。第二个嵌套对象包含进入节点的所有边。第三个嵌套对象包含节点的属性名称和值。在图 6中,我们展示了一个 GraphSON 文件,它表示图 1中所示的属性图,其中多标签被截断,因为这些标签不受明确支持。

GraphSON - TinkerPop 3 (TP3) is a data format for serializing property graphs inspired in JSON and aiming to be easy to split and load in distributed systems. In a GraphSON file, each node is serialized as a JSON object that has the id and labels of the node, along with three nested objects. The first nested object contains all edges going out of the node: their ids, labels and properties. The second nested object contains all edges going into the node. The third nested object contains the property names and values of the node. In Figure 6, we present a GraphSON file that represents the property graph shown in Figure 1, where multi-labels are truncated, as these are not explicitly supported.

图 6.- GraphSON-TP3 格式的示例。
图 6.

GraphSON-TP3 格式的示例。

Example of the GraphSON-TP3 format.

F. 数据格式分析

F. Analysis of Data Formats

表 1显示了上述数据格式的比较。我们考虑了属性图中可能出现的不同特性,并说明它们是否明确受支持。显然,没有一种格式支持所有列出的特性,其中 JSON-Neo4j 和 PG 格式是最完整的。可以说,有些格式可以这样使用,即它们支持一些缺失的特性,但我们会报告每种数据格式的论文或文档中所述的明确支持的特性。

Table 1 shows a comparison of the data formats described above. We considered different features that can appear in property graphs and state if they are explicitly supported or not. It is evident that none of the formats supports all the listed features, with JSON-Neo4j and PG Format being the most complete. It can be argued that some of the formats can be used in such a way that they support some of the missing features, but we report the explicitly supported features as stated in the papers or documentation of each data format.

表 1不同数据格式支持的属性图特征
表 1- 不同数据格式支持的属性图特征

第四部分

属性图数据格式 (PGDF)

The Property Graph Data Format (PGDF)

在本节中,我们将介绍 PGDF,这是一种旨在简单、灵活、富有表现力和高效的数据格式。PGDF 的灵感来自 CSV 数据格式的简单性,但提供了灵活性以适应具有不同架构的节点和边。PGDF 允许表达属性图模型的所有特性,包括节点和边的多个标签以及多值属性。

In this section we present PGDF, a data format designed to be simple, flexible, expressive and efficient. PGDF is inspired by the simplicity of the CSV data format, but providing flexibility to accommodate nodes and edges having different schemas. PGDF allows to express all the features of the property graph model, including multiple labels for nodes and edges, and multi-valued properties.

PGD​​F 文件由节点和边的架构声明和数据声明组成。架构声明定义特定一组节点或边的结构。数据声明定义节点或边的数据。请注意,数据声明必须严格遵循先前定义的架构声明的结构。

A PGDF file consists of schema declarations and data declarations both for nodes and edges. A schema declaration defines the structure of a specific group of nodes or edges. A data declaration defines the data for a node or edge. Note that a data declaration must strictly follow the structure of the schema declaration defined previously.

模式声明的组成部分如图 7所示。节点的模式声明以两个保留键开头,@ID@标签,它们之间用管道符号 (|)。接下来,列出用户定义的属性名称,也用竖线分隔。边的架构声明与节点的架构声明类似,但允许三个附加键:@dir@出去@在这些属性分别用于指示边的类型(有向或无向)、源节点和目标节点。

The components of a schema declaration are shown in Figure 7. The schema declaration for a node begins with two reserved keys, @id and @label, which are separated by a pipe symbol (|). Following this, the user-defined property names are listed, also separated by pipes. The schema declaration for an edge is similar to the one for node, but allowing three additional keys: @dir, @out and @in. These attributes are used to indicate the type of the edge (directed or undirected), the source node, and the target node, respectively.

图 7.- PGDF 模式和数据声明的结构。
图 7.

PGD​​F 模式和数据声明的结构。

Structure of PGDF schema and data declarations.

数据声明的组成部分如图 7b所示。数据声明以节点或边的标识符 (ID) 开头,对于边而言是可选的。然后,必须以字符串形式的单个标签或数组形式的多个标签提供节点或边的标签。对于节点,数据声明以其属性值结束。属性值可以是单值字符串,也可以是多值数组。对于边,下一步是指定其类型 (电视针对定向,或F(表示无向),后面跟着源节点的 ID 和目标节点的 ID。声明以边的属性结束。如架构声明所定义的那样,数据声明的组件由管道符号分隔。

The components of a data declaration are shown in Figure 7b. A data declaration starts with the identifier (ID) of the node or edge, being optional for edges. Then, the labels for the node or edge must be provided either as a single label in the form of a String, or as multiple labels in the form of an Array. For a node, the data declaration ends with its property values. A property value can either be a single-value String or a multi-value Array. For an edge, the next step is to specify its type (T for directed, or F for undirected), followed by the ID of the source node and the ID of the target node. The declaration ends with the properties of the edge. As defined for a schema declaration, the components of a data declaration are separated by a pipe symbol.

算法 1介绍了从属性图G (定义 1中定义)创建 PGDF 文件的过程。首先,它遍历G中的所有不同节点模式(第 2 行)并生成模式声明(第 3-7 行)。然后,算法遍历V中的节点,使它们具有相同的模式(第 8 行)。然后,对于每个这样的节点,算法都会写入其数据声明,包括 id、标签和属性值(第 9-17 行)。对于标签和属性值,我们使用asArray函数,该函数有三种可能的情况:首先,如果给出了一个空集,则asArray返回空字符串;如果给出了一个单例集,则asArray返回编码单个元素的字符串;如果asArray收到一个更大的集合,则返回用逗号分隔的集合中每个元素的字符串。然后,对边重复相同的过程(第 20-49 行),其中唯一的区别是边具有属性@dir@在@出去

The process to create a PGDF file from a property graph G (as defined in Definition 1) is presented in Algorithm 1. First, it iterates through all different node schemas in G (line 2) and generates a schema declaration (lines 3–7). Then, the algorithm iterates through the nodes in V, such that they have the same schema (line 8). Then, for each such node, the algorithm writes its data declaration consisting of id, labels and property values (lines 9–17). For labels and property values, we use the asArray function, which has three possible cases: first, if an empty set is given, the asArray returns the empty string; if a singleton set is given, asArray returns the string that codifies the single element; if asArray receives a larger set, it returns the strings of each element of the set separated by commas. Then, the same process is repeated for the edges (lines 20–49), where the only difference is that edges have the attributes @dir, @in and @out.

算法 1 - 将属性图序列化为 PGDF
算法 1

将属性图序列化为 PGDF

Serialize a Property Graph Into PGDF

图 8中,我们展示了针对图 1中所示的属性图获得的 PGDF 文件的内容。可以看出,PGDF 使用管道字符 (|) 分隔字段,并使用逗号 () 来分隔同一字段的多个值。当属性图的值包含此类字符时,在实践中会出现问题。解决此问题的方法是允许用户定义引号字符,这样一对引号字符之间的分隔符将被忽略。这是解析字符分隔格式的常见做法。

In Figure 8, we show the content of the PGDF file obtained for the property graph shown in Figure 1. As it can be noticed, PGDF uses the pipe character (|) to separate fields, and the comma character (,) to separate multiple values for the same field. A problem arises in practice when the values of the property graph contain such characters. A solution to this problem is to allow users to define a quotation character, such that delimiters found between a pair of quotation characters are ignored. This is common practice in the parsing of character-separated formats.

图 8.- 图 1 的属性图的 PGDF 文件。
图 8.

图 1的属性图的 PGDF 文件。

PGDF file for the property graph of Figure 1.

可以看出,算法 1 的运行时间与| V| + | E| ,首先根据其模式对节点和边进行分区。实际上,可以预期具有相同模式的节点和边存储在相同的 CSV 文件中,因此已经进行了分区。但是,我们可以在这里看到预期文件大小和执行时间之间的权衡。根据模式对节点和边进行分区可确保每个模式只有一个模式声明,这是最小可能数量。但是,由于 PGDF 允许随时进行新的模式声明,因此每个节点和每个边都可以在一次传递中序列化。可以保留“当前”模式,如果下一个节点或边与当前模式相符,则添加节点或边信息的数据声明。如果下一个节点或边与当前模式不符,则添加带有节点或边的模式的模式声明,它将成为新的当前模式。在文件大小的最坏情况下,每次访问节点或边时,当前模式都会发生变化,因此每个数据声明都有一个模式声明。但是,我们预计这种情况很少见,我们稍后会讨论。

It can be seen that Algorithm 1 can be run in time proportional to |V|+|E| , by first partitioning the nodes and edges with respect to their schema. In practice, it can be expected that nodes and edges with the same schema are stored in the same CSV files and are, therefore, already partitioned. However, we can see here a trade off between expected file size and execution time. Partitioning nodes and edges according to schema ensures that there is only one schema declaration per schema, which is the minimum possible number. However, since PGDF allows a new schema declaration at any moment, each node and each edge can be serialized in just one pass. The “current” schema can be kept, and if the next node or edge fits with the current schema, then a data declaration for the node or edge’s information is added. If the next node or edge does not fit with the current schema, then a schema declaration is added with the schema of the node or edge, and it becomes the new current schema. In the worst case for file size, each time a node or edge is visited, the current schema changes, thus there is a schema declaration for each data declaration. However, we expect this case to be rare, and we discuss this later.

第五部分

将 CSV 转换为 PGDF

Converting CSV to PGDF

CSV(逗号分隔值)是一种在数据管理中非常流行的文件格式。网络上有许多开放数据集以 CSV 格式发布(例如https://kaggle.com)。许多面向图形的软件都使用 CSV 来导入和导出数据,包括可视化工具和面向图形的基准测试(例如 LDBC-SNB [7])。

CSV (Comma Separated Values) is a file format which is very popular in data management. Several open datasets are distributed on the Web as CSV (e.g., https://kaggle.com). CSV is used by many graph-oriented software to import and export data, including visualization tools and graph-oriented benchmarks (e.g the LDBC-SNB [7]).

表 2列出了几种图形数据库系统以及它们支持导入图形数据的不同数据格式。可以看出,最广泛支持的格式是 CSV 和 JSON。在 CSV 和 JSON 之间,我们注意到 CSV 通常是最可移植的格式,因为相同的文件可以在多个系统中使用,而 JSON 通常是每个系统所特有的,因为每个系统都需要特定的对象结构和强制性属性。此外,大多数数据库系统也允许将数据导出为 CSV,特别是关系数据库系统,例如 PostgreSQL 和 MySQL。

Table 2 shows several graph database systems and the different data formats they support for importing graph data. It can be seen that among the most widely supported formats are CSV and JSON. Between CSV and JSON, we note that CSV is usually the most portable format, as the same files can be used in several systems, whereas JSON is usually particular to each system, as each system requires a certain object structure and mandatory properties. In addition, most database systems also allow to export data as CSV, particularly, relational database systems such as PostgreSQL and MySQL.

表 2当前图形数据库系统(Y 轴)支持的数据格式(X 轴)
表 2- 当前图形数据库系统(Y 轴)支持的数据格式(X 轴)

出于所有这些原因,我们认为提供一种简单且自动的方法来将 CSV 数据转换为 PGDF 至关重要。此外,使用 PGDF 而不是 CSV 来存储和交换属性图有两个很好的理由:1) PGDF 将所有信息放在一个文件中,方便数据交换,并可能减少繁琐的导入命令;2) PGDF 旨在通过多种功能序列化任意属性图。

For all these reasons, we argue that it is fundamental to provide a simple and automatic way to convert CSV data to PGDF. Moreover, there are two good reasons to use PGDF instead of CSV to store and exchange property graphs: 1) PGDF puts all the information in just one file, facilitating the exchange of the data, as well as potentially reducing cumbersome import commands; and 2) PGDF has been designed to serialize arbitrary property graphs with several features.

在本节的其余部分,我们将介绍一种 CSV 到 PGDF 的转换方法、一种实现此方法的工具,然后我们展示一个基于 LDBC 社交网络基准 (LDBC-SNB) 的用例示例。

In the remainder of this section, we introduce a CSV to PGDF conversion method, a tool that implements this method, and later we showcase a use-case example based upon the LDBC social network benchmark (LDBC-SNB).

A. 换算方法

A. Conversion Method

在本节中,我们将介绍一种将存储为一组 CSV 文件的属性图转换为单个 PGDF 文件的方法。将属性图导出为 CSV 时,通常每种类型的节点或边都有一个 CSV 文件。这些“类型”通常对应于所述边和节点的标签。因此,CSV 到 PGDF 转换方法的输入是一组 CSV 文件,其中每个节点类型和边类型都是一个单独的文件。

In this section we describe a method for converting a property graphs stored as a set of CSV files to a single PGDF file. When a property graph is exported to CSV, it is often the case that there is a CSV file for each type of node or edge. These “types” usually correspond to the labels of said edges and nodes. Therefore, the input of the CSV-to-PGDF conversion method is a set of CSV files where each node type and edge type is a separate file.

为了使输入的 CSV 文件有意义,我们需要创建一个配置 JSON 文件,其中给出了每个单独文件的路径,以便用户可以指定其中哪些是节点,哪些是边,分隔符是什么以及文件是否有标题。此外,用户可以指定列的名称,因为它们是 PGDF 所必需的。

To give some meaning to the input CSV files, we require the creation of a configuration JSON file in which the paths to each individual file are given, such that the user can specify which of these are meant to be nodes and which are meant to be edges, what are the delimiter characters and if the file has a header or not. Furthermore, the user can specify the names of the columns, as they are required by PGDF.

图 9中,我们展示了 JSON 配置文件的模板。在图中我们可以看到两个根级对象:节点边缘. 在节点对象,我们列出了代表节点的 CSV 文件,对于每个文件,用户必须分配以下属性:

In Figure 9, we show a template of the JSON configuration file. In the figure we can see two root-level objects: nodes and edges. In the nodes object, we list the CSV files that represent nodes where, for each file, the user must assign the following properties:

  1. ID,为 CSV 文件提供标识符,以便稍后引用它;

  2. id, to give an identifier to the CSV file, required to refer to it later;

  3. 文件,指示相应 CSV 文件的位置路径(在硬盘上);

  4. file, to indicate the location path (on hard disk) to the corresponding CSV file;

  5. 分隔符,表示 CSV 文件中字段的分隔字符;

  6. delimiter, to indicate the character separating the fields in the CSV file;

  7. 标题,一个布尔值,指示 CSV 文件是否有标题;

  8. header, a boolean value to indicate if the CSV file has a header or not;

  9. 标签,从 CSV 文件中提取的分配给节点的标签列表;

  10. labels, a list of labels to assign to the nodes extracted from the CSV file;

  11. 特性,强制 PGDF 属性列表(例如,@ID) 和属性名称,按照它们在 CSV 文件列中出现的顺序排列。

  12. properties, the list of mandatory PGDF attributes (e.g., @id) and property names in the order these appear in the columns of the CSV files.

然后,在边缘对象,用户必须列出代表边缘的 CSV 文件,对于每个文件,用户必须分配以下属性:
  1. 文件,指示相应 CSV 文件的位置路径;

  2. file, to indicate the location path to the corresponding CSV file;

  3. 分隔符,表示 CSV 文件中字段的分隔字符;

  4. delimiter, to indicate the character separating the fields in the CSV file;

  5. 标题,一个布尔值,指示 CSV 文件是否有标题;

  6. header, a boolean value to indicate if the CSV file has a header or not;

  7. 标签,分配给从 CSV 文件中提取的边的标签;

  8. label, the label to assign to the edges extracted from the CSV file;

  9. 目录, 这是真的或者错误的取决于边缘是否有向;

  10. dir, it is true or false depending on whether the edge is directed or not;

  11. 来源, 这IDCSV 文件中包含与边 255 关联的源节点的 ID;

  12. source, the id of the CSV file that contains the ids of the source nodes associated with the edges255 in the CSV file;

  13. 目标, 这IDCSV 文件中包含与该 CSV 文件中边关联的目标节点的 ID;

  14. target, the id of the CSV file that contains the ids of the target nodes associated with the edges contained in the CSV file;

  15. 特性,强制 PGDF 属性列表(例如,@在@出去) 和属性名称,其顺序与 CSV 列中的显示顺序一致。@ID属性对于边来说是可选的。

  16. properties, the list of mandatory PGDF attributes (e.g., @in and @out) and property names in the order they appear in the columns of the CSV. The @id property is optional for edges.

图 9.- 将 CSV 文件转换为 PGDF 所需的 JSON 配置文件模板。
图 9.

将 CSV 文件转换为 PGDF 所需的 JSON 配置文件模板。

Template of the JSON configuration file required to convert CSV files to PGDF.

图 9中的模板用于将三个 CSV 文件转换为一个 PGDF 文件,其中两个 CSV 文件(节点1.csv节点2.csv)存储节点和第三个文件(边缘.csv) 存储边缘。

The template in Figure 9 is used to convert three CSV files to one PGDF file, where two of the CSV files (nodes1.csv and nodes2.csv) store the nodes and the third file (edges.csv) stores the edges.

然后,为了将 CSV 文件集转换为 PGDF,我们遵循算法 1,因为每个 CSV 文件仅包含一种类型的节点或边模式。首先,将每个包含节点的 CSV 文件序列化为 PGDF。使用 JSON 配置文件定义的各自 JSON 对象中的信息创建每个节点模式的模式。然后,根据 CSV 文件中每行的信息格式化每个 PGDF 数据行。我们对边重复类似的过程。

Then, to convert the set of CSV files to PGDF, we follow Algorithm 1, as each CSV file contains only one type of node or edge schema. First, each CSV file containing nodes is serialized to PGDF. The schema of each node schema is created using the information in its respective JSON object as defined by the JSON configuration file. Then, each PGDF data line is formatted according to the information of each line in the CSV file. We repeat a similar process for edges.

B. 转换工具

B. Conversion Tool

上面描述的 CSV 到 PGDF 转换方法是用 Java 实现的,我们将其封装为 Java 存档 (JAR) 文件,作为我们工具的分发格式,称为CSV2PGDF。该工具可在以下存储库公开获取:https://github.com/dbgutalca/pgdf

The CSV-to-PGDF conversion method described above was in Java, and we encapsulated it as a Java Archive (JAR) file, which serves as the distribution format for our tool called CSV2PGDF. The tool is publicly available at the following repository: https://github.com/dbgutalca/pgdf.

CSV2PGDF可以使用以下形式的指令来执行该工具(例如在 Linux 命令行中):

The CSV2PGDF tool can be executed (e.g. in a Linux command line) by using an instruction of the form:

java -jar CSVConverter pgdf /路径/到/JSON

java -jar CSVConverter pgdf /path/to/JSON

/路径/到/输出/文件夹

/path/to/output/folder

CSV2PGDF工具需要两个参数:JSON 配置文件的路径,以及用户希望存储生成的 PGDF 文件的路径。请注意,各个 CSV 文件的路径是从 JSON 配置文件中读取的。

The CSV2PGDF tool requires two parameters: the path to the JSON configuration file, and the path to where the user wishes to store the resulting PGDF file. Notice that the paths to the individual CSV files are read from the JSON configuration file.

C. 用例

C. Use-Case

现在,我们给出一个用例示例,说明如何将图形数据从其原始 CSV 格式转换为 PGDF。为此,我们使用 LDBC 社交网络基准[7]的数据生成器,该生成器可以使用各种比例因子创建合成属性图。生成的图有八种类型的节点:评论论坛组织地方邮政标签, 和标签类。这些节点通过 23 种不同的边连接,这些边连接特定类型的节点。例如,类型为有主持人连接论坛节点节点。可以在[7]中找到所有类型的节点,边和属性名称的完整描述。

We now present a use-case example that illustrates how graph data can be converted from its original CSV format to PGDF. To do this, we use the data generator of the LDBC Social Network Benchmark [7], which enables the creation of synthetic property graphs using various scale factors. A generated graph has eight types of nodes: Comment, Forum, Organization, Person, Place, Post, Tag, and TagClass. These nodes are connected through 23 different kinds of edges, which connect nodes of specific types. For example, the edges of type hasModerator connect Forum nodes with Person nodes. A full description of all types of nodes, edges and property names can be found in [7].

SNB 数据生成器输出一组 CSV 文件。每个 CSV 文件都包含与节点类型或边类型相关的数据。要使用CSV2PGDF转换工具,我们需要定义 JSON 配置文件,我们在图 10中展示了该文件的摘录。在该图中,我们仅显示了以下类型的节点评论, 和邮政,边缘类型喜欢連接邮政评论节点和边类型有创作者連接评论节点。转换 LDBC-SNB 生成器生成的所有 CSV 文件所需的完整 JSON 配置文件可在以下 URL 中找到:https://github.com/dbgutalca/pgdf/blob/main/config0.1.json

The SNB data generator outputs a collection of CSV files. Each CSV file includes data pertaining to either a node type or an edge type. To convert the multiple CSV files to PGDF using CSV2PGDF conversion tool, we need to define the JSON configuration file, of which we present an extract in Figure 10. In said figure, we only show nodes of types Person, Comment, and Post, with edges of type likes connecting Person with Post and Person with Comment nodes, and edges of type hasCreator connecting Comment and Person nodes. The complete JSON configuration file required to convert all the CSV files produced by the LDBC-SNB generator can be found in the following URL: https://github.com/dbgutalca/pgdf/blob/main/config0.1.json.

图 10.- 提取 JSON 配置文件,从使用 LDBC 社交网络基准的数据生成器生成的数据中生成 PGDF 文件。
图 10.

提取 JSON 配置文件,从使用 LDBC 社交网络基准的数据生成器生成的数据中生成 PGDF 文件。

Extract of the JSON configuration file to produce PGDF files from data produced with the data generator of the LDBC Social Network Benchmark.

使用CSV2PGDF转换工具和前面提到的 JSON 配置文件,我们可以将一组 CSV 文件转换为单个 PGDF 文件。对于这个实验,我们考虑了几种比例因子的 LDBC-SNB 文件。在表 3中,我们比较了不同比例因子下的所有 31 个 LDBC-SNB CSV 文件的大小总和与生成的 PGDF 文件1。可以看出,PGDF 文件平均比 CSV 文件大约大 27.8%,这主要是由于文件每一行中节点和边的标签重复。在第七部分中,我们表明 PGDF 仍然比其他属性图数据格式更方便,因为 PGDF 比其他格式更容易生成、更紧凑。

Using the CSV2PGDF conversion tool, and the aforementioned JSON configuration file, we can convert the set of CSV files into a single PGDF file. For this experiment, we consider the LDBC-SNB files for several scale factors. In Table 3, we present a comparison between the sum of the sizes of all 31 CSV files of the LDBC-SNB at different scale factors and the resulting PGDF file.1 It can be seen that PGDF files are in average ~27.8% larger than the CSV files, which is mostly due to the repetition of the labels of the nodes and edges in each line of the file. In Section VII, we show that PGDF is still more convenient than other data formats for property graphs, as PGDF is easier to generate and more compact than the alternatives.

表 3本文使用的 LDBC-SNB 属性图的信息
表 3-本文中使用的 LDBC-SNB 属性图的信息

第六部分

将属性图转换为其他图形数据格式

Converting Property Graphs to Other Graph Data Formats

在本节中,我们将讨论将属性图序列化为本文介绍的其他图形数据格式(GraphML、JSON、GraphSON 和 YARS-PG)。为此,我们定义了两种转换方法:一种基于主内存 (RAM),一种基于辅助内存 (硬盘)。下面,我们将描述这些转换方法及其实现的工具。

In this section, we discuss the serialization of property graphs to the other graph data formats showcased in this paper (GraphML, JSON, GraphSON, and YARS-PG). To do this, we define two conversion methods: one based on main memory (RAM), and one based on secondary memory (hard disk). In the following, we describe these conversion methods and the tools implemented for them.

A.基于主内存的转换

A. Main Memory-Based Conversion

此方法的输入是一组内存对象,这些对象对属性图的节点和边及其标签和属性进行建模。然后,此表示将以目标数据格式进行序列化。此方法最适合内存中可以容纳的较小图。

The input of this method is a set of in-memory objects that model the nodes and edges of the property graph, along with their labels and properties. This representation is then serialized in the destination data format. This method is most suitable for smaller graphs that can fit in memory.

属性图可以用两种方式序列化。节省空间的方式需要根据模式对节点和边进行分区,然后每个组只能在一个模式行下序列化。但是,分区在实践中可能很昂贵,因此可以考虑第二种转换方法。我们访问每个节点和边一次,如果当前节点或边具有与先前处理的节点或边相同的模式,则序列化该节点或边。如果当前节点或边具有不同的模式,则引入新的模式声明,然后序列化节点或边。我们将第一种方法称为 PGDF-srt(已排序),将第二种方法称为 PGDF-unsrt(未排序)。

A property graph can be serialized in two ways. The space-efficient way requires to partition the nodes and edges according to the schema, and then each group can be serialized under only one schema line. However, partitioning can be expensive in practice, thus a second conversion method can be considered. We visit each node and edge once, if the current node or edge has the same schema as the previously processed one, then the node or edge is serialized. If the current node or edge has a different schema, then a new schema declaration is introduced and then the node or edge is serialized. We call the first method PGDF-srt (sorted), and the second PGDF-unsrt (unsorted).

YARS-PG 的序列化非常直接,因为每个节点和每条边及其标签和属性都被序列化为一行。为了生成符合 Neo4j 的 JSON,我们遵循为 YARS-PG 定义的流程。为此,我们为每个节点和边添加一个 JSON 对象。对于节点,我们使用属性“类型”“节点”在物体中;对于边缘,我们使用“类型”:“关系”,并为边的源节点和目标节点 id 添加嵌套的 JSON 对象。

Serialization to YARS-PG is very direct, as each node and each edge, along with their labels and properties, are serialized as a line. To produce Neo4j compliant JSON, we follow a process as defined for YARS-PG. For this, we add a JSON object for each node and edge. For nodes, we use the property “type”: “node” in the object; and for edges, we use “type”: “relationship”, and add nested JSON objects for the source and target node ids of the edge.

对于 GraphML,我们首先需要图中的所有属性名称。因此,我们遍历所有节点和边,并添加一个钥匙GraphML 文件中每个唯一属性名称的标签。然后,对于每个节点(分别为边),我们完成一个节点(分别边缘) 标签中包含该行的信息。

For GraphML, we first need all the property names in the graph. So we iterate through all nodes and edges, and add a key tag in the GraphML file for each unique property name. Then, for each node (resp. edge), we complete a node (resp. edge) tag including the information in the line.

为了在 GraphSON 中序列化图形,我们需要跟踪每个节点的所有传入和传出边。这样,我们就可以根据第 III-E 节中的定义将每个节点序列化为 GraphSON 对象。这表明创建 GraphSON 文件的过程是昂贵的。

To serialize the graph in GraphSON, we need to keep track of all in-going and out-going edges of each node. In this way, we can serialize each node as a GraphSON object, according to the definition in Section III-E. This suggests that the process of creating a GraphSON file is expensive.

B. 磁盘转换方法

B. Disk Conversion Method

我们还定义了一个过程,用于将存储在多个 CSV 文件中的属性图转换为其他数据格式。上面介绍了将 CSV 转换为 PGDF 的过程,此时需要 JSON 配置文件(参见图 9)来定义一些转换参数。

We also define a procedure to convert a property graph stored in several CSV files to the other data formats. The process to convert CSV to PGDF was presented above, when a JSON configuration file (see Figure 9) is required to define some conversion parameters.

要将 CSV 转换为 YARS-PG,我们首先浏览包含节点的所有文件。对于每个文件,我们浏览每一行,并使用 JSON 配置,从 JSON 配置和当前 CSV 行中提取 ID、标签、属性名称和属性值,从而制作每个节点的序列化。对于边,我们遵循相同的过程。

To convert CSV to YARS-PG, we first go through all the files containing nodes. For each file, we go through each line, and using the JSON configuration, we craft the serialization of each node extracting the id, labels and property names and property values from both the JSON configuration and the current CSV line. For edges, we follow the same process.

可以按照类似的方法生成 JSON,其中考虑到 JSON 配置文件,为每个节点/边 CSV 文件中的每一行创建一个节点或关系对象。

A similar method can be followed to generate JSON, where a node or relationship object is created for each line in each node/edge CSV file, considering the JSON configuration file.

为了将 CSV 文件序列化为 GraphML,我们首先从 JSON 配置文件中获取所有不同的节点和边属性名称,以创建前言<键>标签。然后,我们创建<节点><边缘>对象类似于我们对 PGDF、JSON 和 YARS-PG 所做的那样。

To serialize the CSV files into GraphML, we first get all the distinct node and edge property names from the JSON configuration file to create the preamble <key> tags. Then, we create the <node> and <edge> objects similarly as we do for PGDF, JSON and YARS-PG.

最后,将 CSV 转换为 GraphSON 并不像转换为其他数据格式那样直接。对于每个包含节点的 CSV 文件的每一行,必须读取所有包含边的文件以获取节点的传入和传出边。我们利用IDJSON 配置文件中的属性,因此,我们只读取声明为来源或者目标ID正在处理的节点。

Finally, converting CSV to GraphSON is less direct than to the other data formats. For each line of each CSV file containing nodes, all files containing edges must be read to get the node’s in-going and out-going edges. We make use of the id attribute in the JSON configuration file, thus, we read only the edge files that declare as source or target the id of the node being processed.

可以在以下位置完成从 CSV 到 PGDF、YARS-PG、JSON 和 GraphML 的转换( | V| + | E| ) 时间,而 GraphSON 需要( | V| | E| ) 时间。

Conversion from CSV to PGDF, YARS-PG, JSON, and GraphML can be done in O(|V|+|E|) time, whereas to GraphSON requires O(|V||E|) time.

C. 实施

C. Implementation

上面描述的转换方法以 Java 应用程序的形式实现,并以 JAR 文件的形式分发。代码可以在本文的 GitHub 存储库中找到。要执行该工具,我们将使用以下形式的指令:

The conversion methods described above were implemented as a Java application and distribute as a JAR file. The code can be found in the GitHub repository of the paper. To execute the tool, we shall use a instruction of the form:

java -jar CSVConverter [内存]

java -jar CSVConverter [memory]

graphml|json|graphson|yarspg

graphml| json|graphson|yarspg

/路径/到/JSON/配置

/path/to/JSON/configuration

/路径/到/目标/文件夹

/path/to/destination/folder

用户必须包含所需的输出格式(GraphML、JSON-Neo4j、GraphSON 或 YARS-PG)、JSON 配置文件的路径以及所需输出文件的路径。可选参数 --memory 允许激活-记忆转换。

where the user must include the desired output format (GraphML, JSON-Neo4j, GraphSON or YARS-PG), the path to the JSON configuration file, and the path to the desired output file. The optional parameter --memory allows to activate the in -memory conversion.

第七部分。

实验评估

Experimental Evaluation

在本节中,我们将 PGDF 与其他图形数据格式在输出大小和运行时间方面进行比较。对于此实验评估,我们使用 LDBC 社交网络基准 (LDBC-SNB) [7]的数据生成器生成的数据。我们之所以选择此生成器,是因为可用的真实图形并不多,并且它可以生成不同大小的图形,呈现丰富的连接。一些真实图形,如巴拿马文件数据集,也可以看作关系数据,因为图形边缘之间没有丰富的连接动态(即其拓扑结构类似于树)。

In this section, we compare PGDF with other graph data formats in terms of output size and runtime. For this experimental evaluation, we use data produced with the data generator of the LDBC Social Network Benchmark (LDBC-SNB) [7]. We chose this generator as there are not many real graphs available, and it allows to generate graphs of different sizes, presenting rich connections. Some real graphs, like the Panama Papers dataset, can also be seen as relational data, as there is not a rich connection dynamic among the edges of the graph (i.e., its topology is similar to a tree).

在表 3中,我们提供了实验中使用的图表的信息,包括它们的比例因子和生成的 CSV 文件的大小。实验在两台具有不同数量 CPU 和 RAM 的 Google Cloud 虚拟机上运行。每台机器都有一个 Debian GNU/Linux 11 操作系统,配备 2 核 Intel(R) Xeon(R) CPU @ 2.20GHz 和 500 GB SSD。第一台机器,我们称之为 M1,有 2 个 CPU 和 8GB RAM。第二台机器 M2,有 4 个 CPU 和 16 GB RAM。为了编译和运行代码,我们使用了 OpenJDK 17.0.9 和 Maven 3.6.3。

In Table 3, we present information about the graphs used in the experiments, including their scale factor and the size of the generated CSV files. The experiments were run in two Google Cloud virtual machines with different number of CPUs and RAM. Each machine has a Debian GNU/Linux 11 OS with 2-core Intel(R) Xeon(R) CPUs @ 2.20GHz, and 500 GB SSD. The first machine, which we call M1, has 2 CPUs and 8GB RAM. The second machine, M2, has 4 CPUs and 16 GB RAM. To compile and run the code, we used OpenJDK 17.0.9 and Maven 3.6.3.

A. 文件大小比较

A. Comparison of File Size

首先,我们比较了 PGDF、YARS-PG、JSON 和 GraphML 生成的文件大小。我们没有报告 GraphSON 序列化的结果,因为生成它太耗时了。事实上,从 G1 生成 GraphSON 的尝试执行了三个小时,但仍然没有完成。

First, we compare the sizes of the files produced for PGDF, YARS-PG, JSON and GraphML. We do not report results for GraphSON serialization, as it is excessively time-consuming to produce. Indeed, an attempt to generate GraphSON from G1 was executed for three hours and still did not finish.

表 4显示了所比较图形格式所生成的文件的大小。此外,我们还在图 11中显示的图表中描述了这些结果,其中 Y 轴为对数刻度。在表 4图 11中,我们都可以看到 PGDF 始终比其他数据格式占用更少的磁盘空间。YARS-PG 的表现略差于 PGDF,因为这种格式为边引入了 ASCII 艺术字符,以及属性值的类似 JSON 的表示,所有这些都使用了更多字符,并且在每一行中重复属性名称。GraphML 和 JSON 的表现都差得多,因为这些格式需要几个 PGDF 和 YARS-PG 均未使用的分隔符和引号字符。

The sizes of the files generated for the compared graph formats are presented in Table 4. Additionally, we have also depicted these results in the chart displayed in Figure 11, where the Y-axis is in log scale. In both, Table 4 and Figure 11, we can see that PGDF always uses less disk space than the other data formats. YARS-PG performs slightly worse than PGDF, as this format introduces ASCII art characters for edges, as well as a JSON-like representation of the property values, all of which use more characters, and repeat property names in each line. Both GraphML and JSON perform much worse, as these formats require several delimiter and quotation characters which are not used by PGDF nor YARS-PG.

表 4使用不同比例因子生成并以不同数据格式(PGDF、YARS-PG、GraphML 和 JSON-Neo4j)序列化的 LDBC-SNB 属性图的文件大小
表 4- 使用不同比例因子生成并以不同数据格式(PGDF、YARS-PG、GraphML 和 JSON-Neo4j)序列化的 LDBC-SNB 属性图的文件大小
图 11.- 每个图表每个格式的文件大小。
图 11.

每个图表每个格式的文件的大小。

The sizes of the files per graph per format.

B. 转换时间比较

B. Comparison of Conversion Time

接下来,我们评估将以不同方式存储的属性图转换为每种数据格式所需的时间:PGDF、YARS-PG、JSON-Neo4j 和 GraphML。我们特别关注上面定义的基于内存和基于磁盘的转换方法。我们将这些转换方法应用于表 3中显示的属性图 G1–6 。对于每种转换方法、数据格式、图和机器,我们都报告了执行时间。转换是使用我们的 CSVConverter 工具执行的。

Next, we evaluate the time required to convert property graphs stored in different manners to each data format: PGDF, YARS-PG, JSON-Neo4j, and GraphML. We particularly focus on the memory-based and disk-based conversion methods defined above. We apply these conversion methods to the property graphs G1–6, presented in Table 3. For each conversion method, data format, graph, and machine we report the execution time. Conversion was performed using our CSVConverter tool.

对于基于内存的转换,我们首先读取 G1–6 的 CSV 文件,并使用 Java 对象将其加载到内存中。然后,根据每种格式的规则创建内存中图的序列化。在表 5中,我们列出了每台机器、图形和数据格式的执行时间。图 12a12b也分别针对机器 M1 和 M2 显示了这些时间(Y 轴为对数刻度)。考虑到 M1,我们可以看到在 Java 耗尽堆空间之前,只有 G1 和 G2 可以转换。使用 M2,我们能够转换 G3。可以看出,PGDF 的生成速度比其他方法更快,大约是生成 YARS-PG 所需时间的 70%。请注意,我们对此转换使用了 PGDF-unsrt 策略。就文件大小而言,使用 PGDF-unsrt 生成的文件比使用 YARS-PG 生成的文件小,尽管 PGDF-unsrt 生成了重复的模式声明。使用 PGDF-unsrt 获得的文件大小为 G1 89 MB、G2 270 MB 和 G3 1019 MB。

For the memory-based conversion, we first read the CSV files of G1–6 and load them to memory using Java objects. Then, serializations of the in-memory graph are created according to the rules of each format. In Table 5, we present the execution time for each machine, graph and data format. These times are also presented in Figures 12a and 12b for machines M1 and M2, respectively (the Y-axis are in log scale). Considering M1, we can see that only G1 and G2 could be converted before Java runs out of heap space. With M2, we were able to convert G3. It can be seen that PGDF is faster to produce than the alternatives, taking around 70% of the time required to produce YARS-PG. Note that we used the PGDF-unsrt strategy for this conversion. In terms of file size, files produced with PGDF-unsrt are smaller than the ones produced with YARS-PG, despite PGDF-unsrt producing repeated schema declarations. The file sizes obtained with PGDF-unsrt are 89 MB for G1, 270 MB for G2, and 1019 MB for G3.

表 5基于内存的转换的执行时间
表 5-基于内存的转换的执行时间
图 12.- 基于内存的转换的执行时间。
图 12.

基于内存的转换的执行时间。

The execution times for the memory-based conversion.

在基于磁盘的转换情况下,我们直接将 CSV 文件转换为所考虑的数据格式。表 6显示了转换时间(以秒为单位)。图 13a13b也显示了这些时间,其中 Y 轴为对数刻度。可以注意到,PGDF 总是能更快地生成输出。这是因为 CSV 文件的字段在分离后不需要太多的后处理即可重写为 PGDF。相比之下,YARS-PG、GraphML 和 JSON 需要将每个 CSV 行与 JSON 配置文件中声明的字段组合在一起,而 PGDF 则不需要这样做。此外,GraphML 需要额外的时间,因为它需要创建钥匙在文件开头使用图中存在的不同属性名称添加标签。用于实验的两台机器的执行时间没有显著差异,可能是因为转换中最耗时的部分是 I/O 操作。

In the disk-based conversion case, we directly convert the CSV files to the considered data formats. The resulting conversion times (in seconds) are shown in Table 6. These times are also presented in Figures 13a and 13b, where the Y-axis is in log scale. It can be noticed that PGDF is always faster to produce the output. This is because the fields of the CSV file, after being separated, do not require much post-processing to be re-written to PGDF. In contrast, YARS-PG, GraphML and JSON require the combination of each CSV line with the fields declared in the JSON configuration file, which is not needed for PGDF. Furthermore, GraphML takes extra time, as it needs to create the key tags at the start of the file with the different property names present in the graph. There is not a significant difference in execution time between the two machines used for the experiment, probably because the most time consuming part of the conversion are I/O operations.

表 6基于磁盘的转换的执行时间
表 6-基于磁盘的转换的执行时间
图 13.- 基于磁盘的转换的执行时间。
图 13.

基于磁盘的转换的执行时间。

The execution times for the disk-based conversion.

第八部分。

结论

Conclusion

在本文中,我们介绍了 PGDF,这是一种用于序列化属性图的基于文本的数据格式。我们表明 PGDF 简单、灵活、富有表现力且高效。我们描述了一种将任何属性图序列化为 PGDF 的算法以及一个实现该算法的 Java 工具。此外,我们还定义并实现了从属性图到 YARS-PG、GraphML 和 JSON-Neo4j 的各种转换方法。我们的实验评估表明,与其他格式相比,PGDF 占用的磁盘空间更少,并且输出速度更快。”

In this paper, we presented PGDF, a text-based data format for serializing property graphs. We showed that PGDF is simple, flexible, expressive, and efficient. We described an algorithm for serializing any property graph to PGDF and a Java tool that implements this algorithm. Furthermore, we defined and implemented various conversion methods from property graphs to YARS-PG, GraphML, and JSON-Neo4j. Our experimental evaluation shows that PGDF uses less disk space and is much faster in producing output than the other formats.”

写作过程中生成式人工智能和人工智能辅助技术的宣言

Declaration of Generative AI and AI-Assisted Technologies in the Writing Process

在准备本作品期间,作者使用了 AI-PRO Grammar AI 来改善语言和可读性。使用此工具/服务后,作者根据需要审查和编辑了内容,并对出版物的内容负全部责任。

During the preparation of this work the author(s) used AI-PRO Grammar AI in order to improve language and readability. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Original text
Rate this translation
Your feedback will be used to help improve Google Translate